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Abstract 

Finding statistically significant high-order interaction features in predictive modeling is important 
but challenging task. The difficulty lies in the fact that, for a recent applications with high-dimensional 
covariates, the number of possible high-order interaction features would be extremely large. Identifying 
statistically significant features from such a huge pool of candidates would be highly challenging both in 
computational and statistical senses. To work with this problem, we consider a two stage algorithm where 
we first select a set of high-order interaction features by marginal screening, and then make statistical 
inferences on the regression model fitted only with the selected features. Such statistical inferences are 
called post-selection inference (PSl), and receiving an increasing attention in the literature. One of 
the seminal recent advancements in PSI literature is the works by Lee et al. [1, 2], where the authors 
presented an algorithmic framework for computing exact sampling distributions in PSI. A main challenge 
when applying their approach to our high-order interaction models is to cope with the fact that PSI in 
general depends not only on the selected features but also on the unselected features, making it hard 
to apply to our extremely high-dimensional high-order interaction models. The goal of this paper is to 
overcome this difficulty by introducing a novel efficient method for PSI. Our key idea is to exploit the 
underlying tree structure among high-order interaction features, and to develop a pruning method of 
the tree which enables us to quickly identify a group of unselected features that are guaranteed to have 
no influence on PSI. The experimental results indicate that the proposed method allows us to reliably 
identify statistically significant high-order interaction features with reasonable computational cost. 
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1 Introduction 


Finding statistically reliable high-order interaction features that have significant effects on the response is 
valuable in many regression problems. For example, in biomedical studies, it is well-known that each genetic 
factor such as a single gene does not work independently. When a regression analysis is used in biomedical 
studies for predicting a certain phenotype such as drug response, high-order interactions of multiple genetic 
factors might be useful [3, 4]. If one has a data set with d original covariates and takes into account interaction 
terms up to order r, the regression model has D:=J2%i (f) features. Unless both d and r are fairly small, 
the number of features D would be far greater than the sample size n. Statistical inferences on such an 
extremely high-dimensional regression model is quite challenging. 

A common practical approach to high-dimensional regression problems is two-stage method, where a 
subset of features is first selected, and then a regression model only with the selected features is fitted. A 
statistical issue of such a two-stage method is how to incorporate the effect of the feature selection stage on 
the statistical inference of the final regression model. If the two stages are performed with the same data set, 
confidence intervals or p-values on the final regression model would be positively biased. Statistical inferences 
conditional on pre-feature selection is often called post-selection inference (PSI). Until recently, PSI has been 
recognized to be intractable in most cases because it seems to be difficult to derive the sampling distribution 
that can fully account for complex feature selection process [5, 6]. Recently, Lee et al. [1, 2] introduced an 
affirmative solution to PSI for a wide class of feature selection methods. Specifically, they provided a general 
algorithm for computing an exact sampling distribution of the response conditional on a feature selection 
event which is represented by a set of affine constraints in the response domain. A notable advantage of 
their finding is that many commonly used feature selection algorithms such as marginal screening, orthogonal 
matching pursuit, and Lasso belong to this class. Using the sampling distribution of the response conditional 
on a feature selection event, one can make various statistical inferences on the post-regression model that 
properly incorporate the effect of pre-feature selection. 

The goal of this paper is to develop a method for finding statistically significant high-order interaction 
features by using the idea of Lee et al. [I, 2]. Unfortunately, their method cannot be directly applied to our 
extremely high-dimensional regression model with high-order interaction features. The difficulty lies in the 
simple fact that a feature selection event in general depends not only on the selected features but also on the 
unselected features. It suggests that, at least 0{D) constraints would be needed for characterizing a feature 
selection event. Since the number of features D is extremely large in our high-order interaction model, it 
would be computationally intractable to work with all those constrains. In this paper we mainly study PSI 
on high-order interaction models with marginal screening-based pre-feature selection. In marginal screening, 
we select top k features from all the D features according to the association of each feature with the response. 
Despite its simplicity, marginal screening is one of the most-frequently used feature selection methods, and it 
has been shown to have several desirable statistical properties under some regularity conditions [7, 8, 9, 10]. 
As we describe in the next section, a feature selection event by marginal screening is characterized by a set 
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of 2kk + k affine constraints in the response domain, where k := D — k is the nnmber of unselected features. 
It suggests that the sampling distribution of the response conditional on the marginal screening depends in 
general on these 2kk + k affine constraints. 

Our main contribution in this paper is to develop a novel algorithm that can efficiently find a subset 
of these 2kk + k affine constraints which are guaranteed to have no influence on the conditional sampling 
distribution. Our basic idea is to exploit the underlying tree structure among a set of high-order interaction 
features (see Figure 1). Specifically, we derive an efficient pruning condition of the tree such that, for any 
node in the tree, if a certain condition on the node is satisfied, then all the features corresponding to its 
descendant nodes are shown to have no influence on the conditional sampling distribution. As demonstrated 
in the experiment section, our algorithm allows us to work with a PSI for a high-order interaction model 
e.g., with d = 5000 and r = 5 where the number of all the high-order interaction features D is greater than 
1016 . 


2 Preliminaries 


Problem setup Consider modeling a relationship between a response P € M and d-dimensional covariates 
z = \zi,..., hy the following high-order interaction model up to order 


tlG[(i] UiJ2)&[d]x[d] 0 'l,....Jr)e[d]’' 

ti#f2 ii #■■■#> 

where as are the coefficients and £ is a random noise. We assume that each original covariate Zj,j G [d] 
is defined in a domain [0,1] where values 1 and 0 respectively indicate the existence and the non-existence 
of a certain property, and values between them indicate the “degree” of existence. High-order interaction 
features thus represent co-existence of multiple properties. For example, if zj-^ represents high body mass 
index (BMI) and Zj^ represents a mutation in a certain gene, we may code these two covariates as 


Zi, := 


1 if BMI > 30, 

(BMI-15)/(30-15) if BMI € [15,30], Zj^ 
0 if BMI < 15 


1 if there is a mutation, 

0 if there is no mutation. 


Then, an interaction term Zj^^Zj^ represents the co-existence of high BMI and a mutation in the gene. 

The high-order interaction model (1) has in total D := X]p6[r] (p) features. Let us write the mapping from 
the original covariates 2 : := [zi, ..., Zd\^ € to the high-order interaction features x := [cci, .. ., xd\^ € 
as </> : [0,1]'^ ^ [0, Ijii, 2 I-7- X, where the latter has defined as 

X := (f>{z) = [ 21 , . . .,Zd,ZiZ 2 ., ■ . .,Zd-lZd, . . . ,2l---2fc, ... ,2d-fc+i----2d]^ € . (2) 


Since a high-order interaction feature is a product of original covariates defined in [0,1], the range of each 
feature Xj,j € [D] is also [0,1]. 
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Our goal is to identify statistically significant high-order interaction terms that have large impacts on 
the response Y by identifying regression coefficients as which are significantly deviated from zero. However, 
unless both d and r are fairly small, the number of coefficients as to be fitted would be far greater than 
the sample size n, meaning that the unique least-square solution does not exist, and traditional least-square 
estimation theory cannot be used for making statistical inferences on the fitted model. We thus introduce PSI 
framework where a subset of features is first selected by marginal screening, and then statistical inferences 
on the fitted model only with the selected features are considered. 

Post-selection inference with marginal screening In the high-order interaction feature domain [0,1]'°, 
we consider the same problem setup as Lee et al.’s work [1, 2]. We assume that the data is generated from 
the following process 


(3) 

where y G M" is a random response vector Normally distributed with the mean vector fi G K" and the 
variance-covariance matrix S G M"^". The mean vector fj. in general depends on the fixed (non-random) 
design matrix X G [0,1]"^^. The training set is denoted as {X,y) where y G K." is an observed response 
from the data generating process (3). The training set is also denoted as {{x^, y^)} ie[n] where G [0,1]^ 
is the row of X and yi G R is the element of y. Similarly, the column of X is denoted as xj for 

J G [D]. 

Marginal screening In the first stage, we select top k features that have strong association with the 
response. Noting that each feature is defined in [0,1] and the value indicates (the degree of) the existence of 
a certain property, we consider a score xjy for each of the D features, and select k top features according 
to their absolute scores {\xjy\} 

We denote the index set of the selected k features by S, and that of the unselected k := D — k features 
by S. As pointed out in [1], marginal screening event is characterized by a set of affine constraints. The fact 
that k features in S are selected and k features in S are not selected is rephrased by 

\x]y\ > \xjy\ for all (j,£) gSxS. (4) 

Let Sj := sign{xjy),j G S. Then the feature selection event in (4) is rewritten with the sign constraints of 
the selected features by the following 2kk -I- k constraints 

{-SjXj - Xi)^y < 0, {-SjXj + xi)^y < 0, - SjxJy < 0 V(j,£) gS xS. (5) 

Since the result of marginal screening depends on the observed response vector y, we write the feature 
selection process as a function in the following form 

{5,5,s} = L!(y), 
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where s is denoted as Sj for j € S. The set of 2kk + k constraints in (5) is written as Ay < for a matrix 

A e M(2fcfc+fc)xn ^ g ]^2fefe+fe^ 


Post-selection inferences In the second stage, we consider a linear regression model only with the 
selected features. Let Xs € [0, l]"xfe ^g submatrix of X whose columns are indexed by S. The best linear 
unbiased estimator of the regression coefficients is the following least-square estimator 

/3s := X^^y, where X^ := Xs(XjXs)-^. (6) 

The population counterpart of (6) is written as Bs ■= X^~''y. Under the data generating process (3), the 
distribution of Bs is written as 

Bs - N(Xf^, (7) 


If the set of features S is fixed a priori, then we can make statistical inferences on Bs by using the 
sampling distribution (7). However, if S is selected based on y, the distribution (7) no longer holds. In PSI 
framework, statistical inferences should be made based on the distribution of Bs conditional on the feature 
selection event {5, S, s} = i.e., we need to have distributional result of the conditional random variable 

xtj; I {5,5,s} = L!(3;). 

The following theorem presented by Lee et al.[l, 2] enables us to make post-selection inferences as long 
as the pre-feature selection event is characterized by a set of affine constraints Ay < b. 


Theorem 1 (Lee et al. [1, 2]). Consider a stochastic data generating process y ^ 7V(/x,S). If a feature 
selection event is characterized by Ay < b for an arbitrary matrix A and a vector b that do not depend on 
y, then, for any vector r) G M", 


plV+{A,b),V-(A.b)] 


iv'^y) \Ay<b 


Unif(0,l), 


where is the cumulative distribution function of the univariate truncated Normal distribution with 

the mean t, the variance u, and the lower and the upper truncation points v and w, respectively. Furthermore, 
using c := , the lower and the upper truncation points are given as 


V-{A,b) 


max 

j:{Ac)j<0 


bj - {Ay)j 


+ r]^y, V~^{A,b) := min 

j-.(Ac)j>0 


bp - {Ay)j 


+ v^y- 


Theorem 1 indicates that, if we set rj XgSj, then the sampling distribution of 5, s} = H(y) 

is a truncated Normal, where ej is a vector of all 0 except 1 in the j**' position, and Bsj is the element 
of Bs- If the lower truncation point V~{A,b) and the upper truncation point V'^{A,b) can be computed, 
we can make post-selection inferences on each coefficient of the final regression model in the second stage. 

^ In the case of marginal screening, the vector 6 = 0. However, we keep a vector 6 here for generality: if other feature 
selection method is used such as Lasso, 6 0 in general. 
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However, we cannot handle all the 2kk + k constraints in Ay < b because D is exponentially large in 
our high-order interaction models. In §3, we develop an efficient algorithm by exploiting the underlying tree 
structure among a set of high-order interaction features that enables us to compute the sampling distribution 
of I Ay < b even when A has exponentially large number of rows. 

Related works Before presenting our main contribution, we briefly review related works in the literature. 
Methods for efficiently finding high-order interaction features and properly evaluating their statistical signif¬ 
icances have long been desired in many practical application domains. In the past decade, several authors 
studied this topic in the context of sparse learning [11, 12, 13]. These methods cannot be used for statistical 
inferences on the selected features because their main focus is on asymptotic feature selection consistency. 
In addition, none of these works have special computational trick for handling exponentially large number 
of interaction features, which makes their empirical evaluations restricted to be only up-to second order 
interactions. One commonly used heuristic in the context of interaction modeling is to introduce a prior 
knowledge such as strong heredity assumption [11, 12, 13], where, e.g., an interaction term 21^2 would be 
selected only when both of and Z 2 are selected. Such a heuristic restriction is helpful for reducing the 
number of interaction terms to be considered. However, in many applications, scientists are primarily inter¬ 
ested in hnding strong interaction features even when their main effects alone do not have any association 
with the response. The idea of considering a structure among the features and utilizing some pruning rules 
is common technique in data mining literature [14, 15, 16, 17]. Unfortunately, it is difficult to properly assess 
the statistical significances of the selected features by these mining techniques. 

One traditional approach to assessing the statistical properties on pre-selected features is multiple testing 
correction (MTC). In the context of DNA microarray studies, many MTC procedures for high-dimensional 
data have been proposed [18, 19]. An MTC approach for statistical evaluation of high-order interaction 
features was recently studied in [20, 21]. A main drawback of MTC is that they are highly conservative 
when the number of candidate features increases. Another common approach is data-splitting (DS). In 
DS approach, we split the data into two subsets, and use one for feature selection and another for model 
assessment, which enables us to remove the PSI bias. However, the power of DS approach is clearly weaker 
than the PSI framework by Lee et al. because only a part of the available sample is used for statistical 
model assessment. In addition, it is quite annoying that different set of features could be selected if data is 
splitted differently. Despite two-stage method is frequently used in practical high-dimensional data analysis, 
proper PSI methods have not been available until recently. Besides the approach by Lee et al.[l, 2], several 
new directions to PSI have been studied lately [22, 23]. The main contribution of this paper is to develop a 
practical algorithm for proper statistical assessment of high-order interaction features based on these recent 
progress on PSI literature. 
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Figure 1: An underlying tree structure among high-order interaction features (d = 4,r = 3). 


3 Efficient post-selection inferences for high-order interaction mod¬ 
els 


In this section we present an efficient algorithm for statistical inferences on high-order interaction model 
based on post-selection inference framework. The basic idea is to exploit the underlying tree structure 
among a set of high-order interaction terms as depicted in Figure 1. Using the tree structure we derive a set 
of pruning conditions of the tree that allows us to efficiently compute the sampling distribution conditional 
on the marginal screening, even when it is characterized by exponentially large number of affine constraints. 
In what follows, for any node j in the tree, we let De{j) be the set of all its descendant nodes. In § 3.1, 
we describe a simple computational trick for marginal screening when there are exponentially large number 
of high-order interaction features. Then, in § 3.2, we present our main results on efficient post-selection 
inference for high-order interaction models. 


3.1 Efficient marginal screening for high-order interaction models 

In the first marginal screening stage, we select the top k features according to the absolute scores \xjy\,j e 
[D\. In naive implementation, the absolute scores for all the D features are hrst computed, and then 
top k of them are selected. The computational cost of such a naive implementation is 0{nD), which is 
computationally intractable for our high-order interaction models. To circumvent the computational cost, 
we use the following Lemma. 

Lemma 2. Consider high-order interaction feature vectors Xj € [0,1]", j € [D], whose indices are represented 
in the tree structure depicted in Figure 1. Then, for any node j € [D] in the tree, 

x]y\- ^ x]y'^} for all i € De{j), (8) 

i:y^>0 i:y'^<0 

where Xj is the (t, j)**' element of the design matrix X, i.e., the element of the vector Xj. 

This simple lemma can be easily proved by noting that, for any (z, j) G [n] x [D], x\ < Xj G [0,1] for all 
£ G De{j). The lemma has been also used in the context of itemset mining [15, 16, 17]. 
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Lemma 2 suggests that we can exploit the tree structure for efficiently selecting the top k features. In 
depth first search, if the left-hand-side of (8) in a certain node j is smaller than the fc**' largest absolute score 
obtained so far, we can quit searching over its descendant nodes because the lemma indicates that there are 
no features whose absolute score is greater than the current largest one in the subtree. 


3.2 Efficient post-selection inference for high-order interaction models 

In this section we present our main contribution. As we saw in § 2, a marginal-screening event is represented 
by 2kk + k affine constraints. Theorem 1 indicates that a feature selection event characterized by such a 
set of affine constraints Ay < b changes the sampling distribution of the post-regression model through the 
dependencies of the lower and the upper truncation points V~{A,b) and V^{A,b) on the matrix A and 
the vector b. Our basic idea is to efficiently identify a subset of affine constraints (a subset of the rows in A 
and the elements in b) that have no influences on the lower and the upper truncation points by using a set 
of pruning conditions in the tree structure. 

Theorem 3. Let 9 G [2kk] he the index of the first 2kk affine eonstraints in (5), and let Ci kk}, 

and Cl := {kk -|- 1,..., 2kk{. Furthermore, for notational simplieity, assume that first k features are seleeted 
and remaining k = D — k features are unselected. Then, aside from the sign constraints Sjxjy >0,j G S, 
a marginal screening event Ay < b in (5) is written as 

( < 0 with j{9) := \9/k{, i(9):=k + {9modk) for 0 G Ci, 

— Sj(^g'jXji^ 0 '))^y < 0 with j{9):=\{9 — kk)/k{,£{9):=k + {0modk) for 0 G C 2 . 


Then, the lower and the upper truncation points in Theorem 1 are written as 


V (A, b) = max < max 


Kj( 0 ) + xZg,^0 


max ■ 

9^[2kk] Pj{9) “b ^£^0'^'X.9 

Pj(e)+3:J(a)Xe<0 


V y, 


(9a) 


b) = min < min 




6G[2kk] Pj{6) T 


V y, 


(9b) 


where, for j G [D] and 6 G \2kk]. 


, T - T , -y if 0& Cl, 

Kj — SjXj Pj ’— •— 

' y ifQ^C^, 


Xe ■■= 


c if 9 G Cl, 
—c if 9 G C-i, 


with c = as defined before. Furthermore, let 


at 


die)'— XI X X 

*l?e>0 i\xl>0 i\xl<0 


where Q and xl is the element of ^0 and xe^ respectively. 



For each of the selected feature j € S, consider a tree structure as depicted in Figure 1 which only has a 
set of nodes corresponding to each of the unselected features £ € S. Considering a tree for a selected feature 
j G S, if a node corresponding to £{0), 0 G {0\j{0) = j} satisfies 


K'est > 


K 

max{|pj(0)- 


-^+v^y 

j(e) _ 

-^£^g)\^Pj^e)+b 


+ 

im 


1} 




tf Pm + < 0 , 

otherwise. 


( 10 ) 


then all the constraints indexed by 0' such that £(0') is a descendant of £{0) in the tree are guaranteed to 
have no influence on the lower truncation point V~{A,b), where is the current maximum ofV~{A,b) 
in (9a). Similarly, if 




best — 


< 


S(0)- 




^ + v^y 


S(g) 


max{\pj(e)-bi^f,),\pj(a)+bff,f} 


v^y 


if Pj{0) ^£(9) > 0, 

otherwise. 


( 11 ) 


then all the constraints indexed by 0' such that £{0') is a descendant of £{0) in the tree are guaranteed to 
have no influence on the upper truncation point V~^{A,b), where is the current minimum of {A,b) 

in (9b). 


The proof is presented in Supplementary Appendix A. Note that (10) and (11) can be evaluated by using 
information available at the node £{0). If the conditions in (10) or (11) are satisfied, we can stop searching 
the tree because it is guaranteed that any constraints indexed by 0' such that £{0') G De{£{0)) do not have 
any influences on the truncation point, and hence does not affect the sampling distribution for PSI. 


4 Experiments 

4.1 Experiments on synthetic data 

First, we checked the validity of our post-selection inference algorithm for high-order interaction models 
by using synthetic data. In the synthetic data experiments, we compared our approach (denoted as PSI: 
Post-Selection Inference) with ordinary least-squares method (OLS) and data-splitting method (Split). In 
data splitting method, the data set was randomly divided into two equal-sized subsets, and one of them was 
used for feature selection, while the other was used for statistical inference on post-regression model. 

The synthetic data was generated from y = Xfl + e, e ^ N{0, a^I), where y G M" is the response vector, 
X G {0,1}"^^ is the design matrix, and e G K” is the Gaussian noise vector. Here, we did not actually 
compute the extremely wide design matrix X because it has exponentially large number of columns. Instead, 
we generated a random binary matrix Z G {0, and each expanded high-order interaction feature Xj 

was generated from the row of Z only when it was needed. For simplicity and computational efficiency, 
we assumed that the covariates (hence interaction features as well) are binary, and the sparsity rate ry G [0,1] 
(the rate of zeros in the entries of Z) was changed to see how sparsity is useful for efficient computation. As 
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the baseline, the rest of the parameters were set as n = 100, d = 100, rj = 0.5, a = 0.1, r = 3, /c = 10, and 
significance level a = 0.05. 
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Figure 2: False positive rates of three methods PSI, OLS and Split. 
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Figure 3: True positive rates of PSI and Split. 


(c) r G {I,-- - ,5} 


False positive rate control First, to check whether the methods can properly control the desired false 
positive rate, we generated data sets with (3 = 0 and see how many false positives would be reported by each 
of the three methods. Figure 2 shows the false positive rate defined as fc'/fc where k' is the number of features 
reported as positives in post-regression models. The plots in Figure 2 are the averages over 1000 different 
trials for various sample size n G {50, • • • ,250}, noise level a G {0.1,-•• ,0.5}, and maximum interaction 
order r G {1, • • • ,5}. As expected, OLS could not properly control the false positive rates because statistical 
inferences on the post-regression models would be positively biased when the features were selected by using 
the same data set. On the other hand, PSI and Split could keep the false positive rates as desired a = 0.05 
level. 

True positive rate comparison Next, we compared true positive rates of PSI and Split. We set all 
the coefficient as 0 except /3j = 1 for feature corresponding to the 3'''^ order interaction term Z 1 Z 2 Z 3 . In 
Figure 3, we report the true positive rates for various parameters,which are dehned as the number of times 
when the feature Z 1 Z 2 Z 3 was detected as positive over the number of entire trials. In almost all cases, we see 
that PSI has larger true positive rates than Split. This is reasonable because the sample size used in the 
post-selection inference in the latter was half of the former. 
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Computational efficiency Finally, we demonstrate the computational efficiency of the proposed method. 
In Tables 1 and 2, we show the average computation times on 10 trials in seconds and 1—pruning rates for 
tree traversing with various those parameters. Although the computation time increases with d, r and l — rj, 
the computation times were still in acceptable range. 

Table 1: Computation times [sec] and 1—pruning rates (r = 3). 


d 

Time (77 — 0.9) 

Time (77 — 0.7) 

1— pruning rate (77 = 0.9) 

1— pruning rate (77 = 0.7) 

100 

0.01(0.002) 

0.01(0.007) 

1.29 X 10~^(3.42 X 10“^) 

1.76 X 10“^(1.08 X 10“^) 

500 

0.08(0.02) 

0.16(0.11) 

9.71 X 10“"‘(2.63 X 10~^) 

1.79 X 10“®(1.30 X 10“®) 

1000 

0.19(0.05) 

0.52(0.37) 

2.91 X 10"^(9.77 X 10“®) 

7.11 X 10“'‘(5.49 X 10“^) 

5000 

1.03(0.33) 

3.36(2.92) 

1.18 X 10“®(3.73 X 10“®) 

3.67 X 10“®(3.28 X 10“®) 

10000 

2.70(1.09) 

9.50(9.20) 

4.12 X 10“®(1.45 X 10“®) 

1.39 X 10“®(1.32 X 10“®) 


Table 2: Computation times and 1—pruning rates {d = 5000). 


r 

Time (77 — 0.9) 

Time (77 — 0.7) 

1— pruning rate (77 = 0.9) 

1— pruning rate (77 = 0.7) 

1 

0.03(0.001) 

0.03(0.001) 

1 ( 0 ) 

1 ( 0 ) 

2 

1.01(0.32) 

2.59(2.23) 

1.96 X 10“®(6.23 X 10“®) 

5.20 X 10“®(4.54 X 10“®) 

3 

1.05(0.33) 

3.24(2.78) 

1.18 X 10“®(3.73 X 10“®) 

3.67 X 10“®(3.28 X 10“®) 

4 

1.04(0.33) 

3.98(3.00) 

9.47 X 10“®(2.99 X 10“®) 

4.03 X 10“®(3.39 X 10“®) 

5 

1.03(0.32) 

4.17(3.35) 

9.48 X 10“^®(2.99 X 10“®®) 

4.04 X 10“®®(3.39 X 10“®®) 


Table 3: The number of features reported as positives and computation times [sec]. 


Dataset 

Split 

PSI 

time of 

PSI 

1 st 

2 nd 

3rd 

1 st 

2 nd 

3rd 

Communities&CrimeOO {d — 253) 

1 



2 



1.48 

Communities&Crimell {d — 289) 


2 


3 

2 

1 

1.46 

BlogFeedback (d — 339) 


2 

1 

4 

12 

13 

2.17 

SliceLocalization {d — 769) 


2 

28 


1 

29 

26.33 

UJIIndoorLoc {d — 1053) 

3 

4 


5 

4 

1 

4.29 


4.2 Experiments on real data 

Here we show the statistical power of PSI and Split in real data. Since the true positive features in real data 
are unknown, we show the number of features reported as positives in post-regression models assuming that 
these two methods can properly control false positive rates as we confirmed in synthetic data experiments. 
We obtained datasets from UCf data repository, which listed in the first column of Table 3. Continuous 
covariates in the original datasets were first standardized to have the mean zero and the variance one, and 
then represented the covariate by two binary variables, each of which indicates whether the value is greater 
than 1 or the value is smaller than -1. We estimated the cr in the same way as [1]. We set the maximum 
interaction order as r = 3 and the number selected features by marginal screening as k = 30. For simplicity 
and computational efficiency, we randomly sampled 1000 instances from each dataset. Table 3 shows the 
number of features reported as positives on PSI and Split. In almost all cases, PSI found more positive 
features than Split, while the computational cost of PSI were still in acceptable range. 


11 



5 Conclusions 


In this paper we proposed an efficient PSI on high-order interaction models with marginal screening-based 
pre-feature selection. Our key idea is to derive a pruning condition of the tree that quickly identihes a set 
of unrelated features with PSI. The experimental results indicated that the proposed method allows us to 
reliably identify statistically significant high-order interaction features with reasonable computational cost. 
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A Proof of Theorem 3 


In this appendix we prove Theorem 3. 

Proof of Theorem 3. We only show the lower truncation point part of the Theorem. Consider an arbitrary 
pair {0, O') G [2kk]'^ such that j{0) = j{0') and £{9') G De{i{0)). We first note that the fact that < x\^g-^ 
for all i G [n] indicates that 

0 < 0 < 0 < < b'^^gy 0 < b^^gy < b^^g^ (12) 

(i) First we prove the first case in (10). Using the relations in 12, we have 


Pj(e') + ^J(0')X9' - Pj(e') + b^(gy - bj^gy < 


where we used Pj(e') = Pj(e)- Next, when Pj(ei) + xJ^^gyXe' < 0, 

Kj(0’) + xj^gy^gi ^ ^ ~ 0'£(e') ^ 

Pj(0') + xJ^gyX0' P]{0') + ^t(0') ~ ^t(0') \P3{S') ~ ^£{0')\ \Pji6) ~ ^e(0)\ 


(14) 


where we used the fact that the numerator is non-negative, and the denominator is non-positive in the 
left-most fraction. From (13) and (14), we have 


+ bJta-. < 0 and — -- , ^ , + ri 


PjW + °3{0) 


\P3{S) ^t(0)\ 


V y<Vyf, 


+ ^ 1 ( 0 ')^^' 
Pj(0') + X:J^gyX0> 


V y<Kest> 


which proves the first case in (10). (ii) Next, we prove the second case of (10). When we do not know the 
sign of the denominator Pj( 0 >) + xJ^gyX 0 ', we can obtain a slightly loose bound in the following form 


Kj(0l) + xj^gy^gi ^ Kj(^gy + a^^gy a^^gy ^ 

PjiS') + xlgyX0' Pj(0') + btgy - bZgy - max{\pg^gy - bTgyl \pg(^gy -b b+A} 


'^3{S) ^i{ 0 ) 

max{\pg^0) - bZgA, \pg^e) + b+\}' 


From (15), 


^1(0) T / 1 ^- ^ , T / 1 ^- 

I I - ,■+ n +^ y<Kest ^ -- , T ^ y<Kest> 

'I ^/)fa\\', \P3(\ ^/)fa\\\ Pj{_0'^ ' ^ 


max{|p^(e) - b„g± \pg(^0) + 5+ J} 


which proves the second case of (10). Combining (i) and (ii), we showed that, if (10) is satisfied for a certain 
d, then any constraints indexed by 0' such that j{9') = j{9) and £{9') G De{£{9)) are guaranteed to have no 
influence on the lower truncation point V~{A, b). The upper truncation point part of the Theorem can be 
shown similarly. ■ 
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